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Abstract: This paper considers estimation of the predictive density for 
a normal linear model with unknown variance under a-divergence loss for 
— 1 < a < 1. We first give a general canonical form for the problem, and 
then give general expressions for the generalized Bayes solution under the 
above loss for each a. For a particular class of hierarchical generalized priors 
studied in Maruyama and Strawderman (2005, 2006) for the problems of 
estimating the mean vector and the variance respectively, we give the gen- 
eralized Bayes predictive density. Additionally, we show that, for a subclass 
of these priors, the resulting estimator dominates the generalized Bayes es- 
timator with respect to the right invariant prior when a = 1, i.e., the best 
(fully) equivariant minimax estimator. 
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1. Introduction 

We begin with the standard normal linear regression model setup 

y~N n (X(3,a 2 I n ), (1.1) 

where y is an n x 1 vector of observations, X is an n X k matrix of k potential 
predictors where n > k and rank X = k, and (3 is a k x 1 vector of unknown 
regression coefficients, and a 2 is unknown variance. Based on observing y, we 
consider the problem of giving the predictive density p{y\P, u 2 ) of a future to x 1 
vector y where 

y~N m (Xp,a 2 I m ). (1.2) 

Here X is a fixed to x k design matrix of the same k potential predictors in X, 
and the rank of X is assumed to be min(m, k). We also assume that y and y are 
conditionally independent given j3 and a 2 . Note that in most earlier papers on 
such prediction problems, a 2 is assumed known, partly because this typically 
makes the problem less difficult. However, the assumption of unknown variance 
is more realistic, and we treat this more difficult case in this paper. In the 
following we denote by -0 all the unknown parameters {/3,er 2 }. 
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For each value of y, a predictive estimate p(y; y) of p(y\tp) is often evaluated 
by the Kullback-Leibler (KL) divergence 

D KL {p(y;y),v{m)} = /fW)log||!||^ (1.3) 

which is called the KL divergence loss from p(y\ip) to p(y; y). The overall quality 
of the procedure p(y; y) for each ip is then conveniently summarized by the KL 
risk 

RKL(p{y;y),tp) = D KL {p(y;y),p(y\ij})}p(y\ijj)dy (1.4) 



where p(y\ip) is the density of y in (1.1). Aitchison (1975) showed that the 
Bayesian solution with respect to the prior Tr(tp) under the loss Dkl given by 
(1.3) is what is called the Bayesian predictive density 

Urn v) = S ~ 1 j^^^r^ = J P(m<i>\v)dTP (i.5) 

where 



fp(y\ip)n(il>)dil>' 

For the prediction problems in general, many studies suggest the use of the 
Bayesian predictive density rather than plug-in densities of the form p(y\i^(y)), 
where ip is an estimated value of tp. In our setup of the problem, Liang and Barron 
(2004) showed that the Bayesian predictive density with respect to the right 
invariant prior is the best equivariant and minimax. Although the Bayesian 
predictive density with respect to the right invariant prior is a good default pro- 
cedure, it has been shown to be inadmissible in some cases. Specifically, when 
a 2 is assumed to be known and the following are assumed 

m > k > 3, n = mN 

~ ~ J - (AS1) 
X = 1 N ®X=(X',...,X')' 



where TV is an positive integer, 1 n is an TV x 1 vector each component of which is 
one, and ® is the Kronecker product, Komaki (2001) showed that the shrinkage 
Bayesian predictive density with respect to the harmonic prior 

tts.oWO = tG8) = {P'X'Xpy-V 2 (1.6) 
dominates the best invariant Bayesian predictive density with respect to 

7T 7 ,0 W = 71-08) = 1. (1.7) 

George, Liang and Xu (2006) extended Komaki (2001)'s result to general shrink- 
age priors including Strawdcrman (1971)'s prior. As pointed out in the above, 
we will assume that the variance a 2 is unknown in this paper. The first decision- 
theoretic result in the unknown variance case was derived by Kato (2009). He 
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showed that, under the same assumption of Komaki (2001) given by (AS1), the 
Bayesian predictive density with respect to the shrinkage prior 

tts.iWO = Ji-08, a 2 ) = {P'X'Xpy- k ' 2 {a 2 }- 2 (1.8) 

dominates the best invariant predictive density which is the Bayesian predictive 
density with respect to the right invariant prior 

7r IA ^) = n^,a 2 ) = {a 2 }-\ (1.9) 

From a more general viewpoint, the KL-loss given by (1.3) is in the class of 
a-divergence introduced by Csiszar (1967) and defined by 

D a {p(y;y),p(m} = J fa Ojfijffi) PW)d» (1-10) 

where 

fa(z) = < zlogz a = 1 

I— logz a = —1. 

Clearly the KL-loss given by (1.3) corresponds to D-\, Corcuera and Giummole 
(1999) showed that a generalized Bayesian predictive density under D a is 

p„, a {m y) « { [Ip {1 - a)/2 (m<mdi>] 2/(1_Q) a * i (1 n) 

\cxp{J\ogp(y\ip)7r(ij\y)dtp} a = 1. 

Hence the Bayesian predictive density of the form (1.5) may not be good under 
a-divergence with a ^ — 1. But as Brown (1979) pointed out in the estima- 
tion problem, decision-theoretic properties often seem to depend on the general 
structure of the problem (the general type of problem (location, scale), and the 
dimension of the parameter space) and on the prior in a Bayesian-setup, but 
not the loss function. In fact, we will show, under the assumption (AS1) and the 
D\ loss, the predictive density with respect to the same shrinkage prior given 
by (1.8) improves on the best invariant predictive density with respect to (1.9) 
(See Section 4). From this viewpoint, we are generally interested in how robust 
the Stein effect already founded under D a loss for a specific a is under D a loss 
for general a. For example, we can find some concrete problems as follows. 

Problem 1 Under the assumption (AS1) and the D a loss for — 1 < a < 1, 
does the predictive density with respect to the same shrinkage prior given 
by (1.8) improve on the best invariant predictive density with respect to 
(1.9)? 

Problem 2-1 Under D\ loss, even if k = 1,2, the best invariant predictive 
density remains inadmissible because an improved non-Bayesian predictive 
density is easily found. (See Section 4.) Can we find improved Bayesian 
predictive densities for this case (k — 1,2)? 
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Problem 2-2 Under k = 1, 2 and the D a loss with — 1 < a < 1, does the best 
invariant predictive density keep inadmissibility? If so, which Bayesian 
predictive density improve it? 

In this paper, a main focus is on Problem 2-1 and 2-2. For Problem 2-1, we will 
give an exact solution. We could not solve Problem 2-2 in this paper, but by a 
natural extension of the shrinkage prior considered for Problem 2-1 (£>i loss), 
we will provide a class of predictive densities which we hope lead the solution 
in the future work. In addition, Problem 1 is open. 

The organization of this paper is as follows. We treat not only simple design 
matrices like (AS1) but also general ones noted at the beginning of this sec- 
tion. In order to make the structure of our problem clearer, Section 2 gives the 
canonical form of the problem. In Section 3, we consider a natural extension of 
a hierarchical prior which was originally proposed in Strawderman (1971) and 
Maruyama and Strawderman (2005) for the problem of estimating (3. Using it, 
we will construct a Bayesian predictive density under D a loss for — 1 < a < 1 
and a = 1. In Section 4, we show that a subclass of the Bayesian predictive 
densities proposed in Section 3 is minimax under Di loss even if k is small. 
Section 5 gives concluding remarks. 

2. A canonical form 

In the section, we reduce the problem to a canonical form. To simplify expres- 
sions and to make matters a bit clearer it is helpful to rotate the problem via 
the following transformation. First we note that for the observation y, sufficient 
statistics are 

= {X'X^X'y ~ N k {p,a 2 {X'X)- 1 ), 
S=\\{I-X{X'X)^X') V \\^a\l_ k 

where /3jj and S are independent. 

Case I: When m > k, let AI be a nonsingular k x k matrix which simulta- 
neously diagonalizes matrices X'X and X'X, where M satisfies 

M'iX'X^M = diag(di, . . .,d k ), MM' = X'X 

where d± > ■ ■ ■ > d k . Let V = M'fiu and 9 = M'j3. 

Case II: When m < k, there exists an (k — m) x k matrix X* such that 
(X', X'^Y is a k x k non-singular matrix and also X(X' 'X) _1 X£ is an m x (k — 
m) zero matrix. Further there exists an m x m orthogonal matrix P which 
diagonalizes o 2 X(X'X)~ 1 X' , the covariancc matrix of X(3u, i.e., 

P'XiX'XyKX'P = diag(d!, . . . , d m ) 

where d± > • • • > d m . There also exists a (k — m) x (fc — m) matrix P* such that 

p'Mx'x)- 1 x' lt p, = i k -i. 
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Then V and V* where 

are independent and have multivariate normal distributions N m (P' X (3 , a 1 D) 
and N k - m (PlX*/3,a 2 I k - m ) respectively. Let 9 = P'X/3 and pt = PixJ/3. 

In summary, a canonical form of the prediction problem is as follows. We 
observe 

V ~ Ni&n-'D), V. ~ Nk-fari- 1 !), qS ~ xl- k (2-1) 

where r\ = <j~ 2 , I = min(/e, m), D = diag(di, . . . , di) and d\ > ■ ■ ■ ~> di- When 
m > k, V* is empty. Then the problem is to give a predictive density of an 
m-dimcnsional future observation 

y~JV ro (Qfl ) T 7 - 1 I m ) (2.2) 

where Q is an m x i matrix, which is given by 



Q = 



jP ifTO<fc 

[X(M')- 1 ifm>fc, 



and hence satisfies Q'Q = It. Notice that, under the assumption given by(ASl), 
D becomes iV -1 /^, V* is empty, and Q becomes X(X' 'X) -1 / 2 . 

The distribution of y in (2.2) is the same as in (1.2), so it is just the y's 
that have been transformed. In the remainder of the paper, we will consider 
the problem in its canonical form, (2.1) and (2.2). We will use the notation 
p(y\y) in the following although it may be more appropriate to use p(y\v, «*, s) 
or p(y\$u,s). 

3. A class of generalized Bayes predictive densities 

In this section, we consider the following class of hierarchical prior densities, 
Tr(9,fj,,rj), for the canonical model given by (2.1) and (2.2). 

9\ri, A ~ N t (0, tT 1 {D- 1 + {(1 - a)^}/;)" 1 (C/A - J,)) 

H\V, A ~ N k _ t (O^-^t/A - l)J fc _i) 
jyoc^, AocA a (l-A) b / (0jl) (A), 

where C — diag(ci, . . . , c;) with Cj > 1 for 1 < i < I, b = b(a) = (1 — 
a)m/4 + (n — fc)/2 — 1 and 7 > 1. The integral which appears in the Bayesian 
predictive density below will be well-defined when a > — k/2 — 1. An essen- 
tially equivalent class was considered for the problem of estimating 9 and a 2 in 
Maruyama and Strawdcrman (2005, 2006) respectively. When m > k, the prior 
on [i is empty and we have only to eliminate ||V*|| 2 /j from the representation 
of the Bayesian solution in the following theorems 3.1 and 3.2, in order to have 
the corresponding result. 
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3.1. Case i: a £ [—1, 1) 

Theorem 3.1. The generalized Bayes predictive density under D a divergence 
with respect to the prior (3.1) is given by 

Pa{y\y) oc P{u,a}{y\y) x p a {y\y), (3.2) 

where 

-rn/2-(n-fc)/(l-a) 



(3.3) 



(3.4) 



P{£/,o:}(y|y) = {(y- QvyZu (v - Qv) + s} 
P a (y\y) = Uy- Qhy^iy - Qe B ) + R 

and where 

Eu = {2/{l-a)}I + QDQ' 
§ B = (C- I)(C + (1 - a)D/2)- 1 v 

E B = {2/(1 - a)}I + Q(C - I)D(C + {(1 - a)/2}D)- x Qf 
R(v) = «'({(1 - a)/2}D + Ip-^C + {(1 - a)/2}D)- 1 v 
Proof. See Appendix. □ 

The first term Pw,a}(v\y) * s the best invariant predictive density, and is 
Bayes with respect to the right invariant prior 7r(#, /i, rj) = T) . Upon normal- 
izing, P{u, a }(y\y) is niultivariatc-t with the mean Qv = Xj3jj. We omit the 
straightforward calculation. Liang and Barron (2004) show that P{u,a}(y\y) has 
a constant minimax risk. 

The second term, p a , is a pseudo multivariatc-t density with the mean vector 
Q6b- Since ||#b|| < ||v|| is clearly satisfied, p a induces a shrinkage effect toward 
the origin. The complexity in the second term is reduced considerably with the 
choice C = I, in which case 9b = 0, E B = {2/(1 — a)}I and R(v) = v'D~ l v. 
However, since the covariance matrix of v, a 2 D, is diagonal but not necessarily a 
multiple of /, the introduction of C ^ I seems reasonable. Indeed in the context 
of ridge regression, Casella (1980) and Maruyama and Strawderman (2005) have 
argued that shrinking unstable components more than stable components is rea- 
sonable. An ascending sequence of Cj's leads to this end. Hence the complexity, 
while perhaps not pleasing, is nevertheless potentially useful. 

3.2. Case ii: a — 1 

Theorem 3.2. The generalized Bayes predictive distribution under D\ diver- 
gence with respect to the prior (3.1) is normal distribution N m (6 v> c c^m) 
where 



I C" 1 )v 

u + l + W 



tic = I 1 



v + l + W n- k 



S 
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and where W = {V C^D^V + \\V* f/l}/S and v = (k + 2a + 2)/(n - k). 
Proof. See Appendix. □ 

It is quite interesting to note that the Baycsian predictive density p a (y\y) for 
a 6 [—1,1) given in Section 3.1 converges to 4> m {y, Q6v,c, as a -> 1 where 
4 > m(',£,,T 2 ) denotes the m-variate normal density with the mean vector £ and 
the covariance matrix T 2 I m . 

Since the Bayes solution is the plug-in predictive density as shown in Theorem 
3.2, we pay attention only to the properties of plug-in predictive density under 
the loss D\. The a-divergence with a = 1, from <fi m (y,Q9,a 2 ), the predictive 
normal density with plug-in estimators 9 and <r 2 , to <fi m (y,Q9,a 2 ), the true 
normal density, is given by 



(j> m (y,Q 9,o- 2 ) 

9,CT 2 )' 

\\y-Q0\\ 2 



• 4>m{y,Q0,o- )dy 




2a 2 j 
2a 2 " 



= ^{L 1 (0,0,a 2 )+mL 2 (a 2 ,a 2 )Y 

In (3.5), L\ denotes the scale invariant quadratic loss 

9N 0-6)' (6 -6) 



for and L2 denotes the Stein's or entropy loss 

"2 ^2 

L 2 (a 2 ,a 2 )^^ -log ^ - 1 . 



f7- 



for fj 2 . Hence when the prediction problem under a-divergence with a = 1 is 
considered from the Baycsian point of view, the Bayesian solution is the normal 
distribution with plug-in Bayes estimators and the prediction problem reduces 
to the simultaneous estimation problem of 9 and a 2 under the sum of losses as 
in (3.5). 



4. Improved minimax predictive densities under D 1 

In this section, we give analytical results on minimaxity under D\ loss. As 
pointed out in the previous section, the prediction problem under D\ loss, re- 
duces to the simultaneous estimation problem of 9 and a 2 under the sum of 



Y. Maruyama and W. Strawderman/Bayesian predictive densities 



8 



losses as in (3.5). Clearly the UMVU estimators of 9 and a 2 are 9\j = V and 
aff = S/(n — k). These are also generalized Bayes estimators with respect to the 
the right invariant prior n(6,fj,,ri) = r/^ 1 and are hence minimax. The constant 
minimax risk is given by MR^ ff 2 where 

MR 8 ^ = X - jtr£> + m flog 7 - | (4.1) 

and 7 = (n — k)/2. 

Recall that from observation y, there exist independent sufficient statistics 
given by (2.1): 

where r/ = a~ 2 . I = min(/c,m), D = diag(e?i, . . . , d{) and d\ > ■•■ > di- When 
m > k, V* is empty. 

In the variance estimation problem of a 2 under L2, Stein (1964) showed that 
S/ (n — k) is dominated by 

-2 ( S V , D- 1 V + S\ fAn . 



for any combination of {n, k, m) including even I = min(fc, m) = 1. Hence, in 



the simultaneous estimation problem of 9 and a 2 , we easily see that {Ou, v 2 /} is 



dominated by {Out&st} ano - hence have the following result. 

Proposition 4.1. The estimator {^j/,^} is inadmissible for any combination 
of{n,k,m}. 

The improved solution, {0u, ^stL is unfortunately not Bayes. When I > 3 
and 

Z-2<2(VEL*-2Y (4.3) 



we can construct a Bayesian solution using our earlier studies as follows. In the 
estimation problem of 6 under L\, Maruyama and Strawderman (2005) showed 
that the generalized Bayes estimator of 9 with respect to the harmonic-type 
prior 

Tr Sil (9,r ] ) = {6'D- 1 e} 1 - 1 / 2 (4.4) 

improves on the UMVU estimator 9jj when I > 3 and (4.3) is satisfied. In the 
variance estimation problem of a 2 under L2, although Maruyama and Strawderman 
(2006) did not state so explicitly, they showed that the generalized Bayes esti- 
mator of o~ 2 with respect to the same prior (4.4) dominates the UMVU estimator 
afj when I > 3. Hence the prior (4.4) gives an improved Bayesian solution in 
the simultaneous estimation problem of 9 and a 2 when / > 3 and (4.3) is sat- 
isfied. (Note that under the special assumption (AS1) introduced in Section 
1, D becomes the multiple of identity matrix and hence (4.3) is automatically 
satisfied.) 
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However, in the above construction of the Baycsian solution, two assumptions, 
I > 3 and (4.3), are needed. Further even if m < k and V* exists, the Bayes 
procedure does not depend on V* . This is not desirable because the statistic V* 
has some information about rj or a 2 . In fact, the Stein- type estimator of variance 



' ST* 



s \\v.\\ 2 + s 



— k n — I 



(4.5) 



as well as {<r| T } dominates and hence {dui&sT*} a l so dominates {fe,^} 
in the simultaneous estimation problem. 

Now we show that a subclass of the generalized Bayes procedure under Di 
given in Section 3.2 improves on the generalized Bayes procedure with respect 
to the right invariant prior. We assume neither I > 3 nor (4.3). Additionally the 
proposed procedure does depend on V* if it exists. 

Theorem 4.1. The generalized Bayes estimators of Theorem 3.2, 



'v.C 



>v,C 



= [i- 



v+l + W J n-k 1 



S 



where W = {V'C^D^V + \\V*\\ 2 / 7} / S , dominate the UMVU estimators (V 
and S/(n — k)) under the loss (3.5) if 7 > 1 and < v < min(^i, z^, ^3) where 

a J2(di/ C i) ~ 2 m.ax(di/ci) + mj (n — k) 
V\ = 4- 



u 2 



2max(dj/cj)(n — k + 2) + m 
HJ2( d i/ c i) ~ max(dj/ci)} + 2m/ (n - k) 
(n — k — 2) max(di/ci) + m 



4 s-^di 
i/ 3 = — > — . 
m * — ' Cj 

Proof. See Appendix. □ 

Clearly and ^3 are always positive. Now consider v\ . Assume v\ is negative 
for fixed Co- But there exits go > 1 such that C = go Co makes v\ positive. 
Hence we can freely choose an ascending sequence of Cj's which guarantees the 
minimaxity of (6l,,c, p 2 , c ) and increased shrinkage of unstable components. 
Remark 4.1. We make some comments about domination results under the Di 
loss for the case of a known variance, say a 2 = 1. By (2.1) and (3.5), the 
prediction problem under the D\ loss reduces to the problem of estimating 
an /-dimensional mean vector 9 under the quadratic loss Li(8,9) = \\9 — 8\\ 2 
in the case where there exists a sufficient statistic V ~ Ni(6,D). It is well 
known that the UMVU estimator V is admissible when / = 1 , 2 and inad- 
missible when / > 3. Minimax admissible estimators for I > 3 have been 
proposed by many researchers including Strawderman (1971), Berger (1976), 
Fourdrinier, Strawderman and Wells (1998), and Maruyama (2003). On the other 
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hand, for KL (i.e. £>-i) loss, George, Liang and Xu (2006) used some techniques 
including the heat equation and Stein's identity, and eventually found a new 
identity which links KL risk reduction to Stein's unbiased estimate of risk re- 
duction. Based on the link, they obtained sufficient conditions on the Bayesian 
predictive density for minimaxity. Hence we expect that there should exist an 
analogous relationship between the prediction problem under the D a loss for 
\a\ < 1 and the problem of estimating the mean vector. As far as we know, this 
is still an open problem. 

5. Concluding remarks 

In this paper we have studied the construction and behavior of generalized Baycs 
predictive densities for normal linear models with unknown variance under a- 
divcrgcncc loss. In particular we have shown that the best equivariant, (Baycs 
under the right invariant prior) and minimax predictive density under D\ is 
inadmissible in all dimensions and for all residual degrees of freedom. We have 
found a class of improved hierarchical generalized Bayes procedures, which gives 
a solution to Problem 2-1 of Section 1. 

The domination results in this paper are closely related to those in Maruyama and Strawderman 
(2005, 2006) for the respective problems of estimating the mean vector under 
quadratic loss and the variance under Stein's loss. In fact a key observation that 
aids in the current development is that the Bayes estimator under D\ loss is a 
plug-in estimator normal density with mean vector and variance closely related 
to those of the above papers, and that the D\ loss is the sum of a quadratic loss 
in the mean and Stein's loss for the variance. 

We expect that an extension of a hierarchical prior given in Section 3.1, for 
the prediction problem under the D a loss for — 1 < a < 1, can form a basis 
to solve Problem 2-2 of Section 1. To date, unfortunately, we have been less 
successful in extending the domination results to the full class of a-divergence 
losses. 

Appendix A: Appendix section 
A.1. Proof of Theorem 3.1 

The Bayesian predictive density p a (y\y) under the divergence D a for general 
a 6 [—1, 1) is proportional to 



2 




(A.1) 
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and hence the the integral in brackets is concretely written as 
JJJfv* 2 * exp - QOf) cxp (-f ) 

x ^ exp (-|(« - tf)'^ 1 ^ - 0)) exp - M f ) 

- Ai/V/2 expf-^f^.^Vc/A-ZO- 1 ^ ^ 



n(ci-A)va 

At? \( fc -'>/ 2 / 77 A||/x| 



exp -77-^ ^7 a A a (l - XYdBdiidridX. 
7- Ay V 2 7-A/ 

To aid in the simplification of the integration with respect to 0, we first re- 
express those terms involving 9 by completing the square, and neglecting, for 
now, the factor 77(1 - a)/ A. Let D* = {(1 - a)/2}D. Then 

- Q8\\ 2 + {v- 6)'D- X (v -9) + 8' {I + D- l )(C/\ - I)~ 1 8 

= 9'(I + D- 1 )^ - C^XyH - 29' {Q'y + D~ x v) + \\y\\ 2 + v'D~ x v 
= {9-(I + D- 1 )- 1 ^ - C _1 A)(Q'y + D-^YHI + D- l ){I - C^X)- 1 } 
x{9-(I + D- 1 )- 1 ^ - C-'X^Q'y + D^v)} 

- (Q'y + £>»'{(/ + D- 1 )- 1 ^ - C-'xmQ'y + D~\) + \\y\\ 2 + v'D^v. 
The "residual term" , 

~(Q'y + D^v)'{(I + D-YHl - C _1 A)}(Q'y + D?v) + \\y\\ 2 + v'D;\ 

may be expressed as A + X{B — A} where 

A = A(y,v,D*,Q) 

= \\yf + v'D~ x v - {Q'y + D~ x v)'(I + D^Y^Q'y + D~ x v) (A.3) 
= {2/(1 - a)}(y - Qv)'^ x (y - Qv), 

where Y<u is given by (3.4) and 

B = B(y,v,C, D.,Q) 

= \\y\\ 2 + v'D^v - (Q'y + D^vni + D-YHl - C^Q'y + D~ x v) 

= {2/(1 - a)}{(y - Qe B )'^ B \y - Q§ B )} 

+ {2/(1 - a)} {v'({(l a)/2}D + I)D-\C + {(1 - a)/2}D)- l v) , 

(A.4) 

where 9b and S B are given by (3.4). The third equalities in (A.3) and (A.4) will 
be proved in Lemma A.l below. Similarly we may re-express the terms involving 
/j, as 



v* - 



,2 , A|M| 2 7 11 /-1 ri W , m,2 , J|«.|| S 



7 — A 7 — A 



ii M (i-{i-A/ 7 K)ir+AJ 
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After integration with respect to 9 and fi, the integral given by (A. 2) is propor- 
tional to 



^(l-a)m/i+n/2+a^k/2+a^ _ 

(A.5) 



OC f 1 A fc/2+a (l - ^)( 1 -")" l /4+(n-fc)/2-l 



-(l-a)m/4— n/2— a— 1 



L-^A + s + \(l^(B-A) + ^)j 

Note that in an identity given by Maruyama and Strawdcrman (2005), (See 
page 1758) 

l 

X a (l-\f(l + wX)- n 'dX 

the integral of the right-hand side reduces the beta function Be(a + 1, /3 + 1) 
when — a — /3 + 7 — 2 = 0. Hence the integral (A.5) is exactly proportional to 

1-q \ a ) m / 4— ( n_fc )/ 2 ( \ — a |K I|2X - fe / 2 " a - 1 



+ . (A.7) 

Since the Bayesian predictive density p a (y\y) with respect to the prior n(6, /i, 77) 
is proportional to the integral (A.7) to the 2/(1 — a) power, the theorem follows. 

Lemma A.l. Let F and be diagonal matrix. The matrix Q is assumed to 
satisfy Q'Q — I . Then 

G(y,v,F,D*,Q) 

= \\y\\ 2 + v'D~ l v - (Q'y + D^v)'^ + D- l )- l F{Q'y + D^v) 

is re-expressed as 

{y - QF(I + D*(I- F^vj'il + QFD*{I + D,(J- F))~ 1 Q'}~ 1 
x {y - QF(I + D*{I - F))-^} 
+ v'(D* + 1)(J - F)D- 1 (I + £>*(/- Fj^v. 

Proof. The function G(y, v, F, D*, Q) is re-expressed as 

G = y'{I - Q(I + D~ 1 y 1 FQ')y - 2y'Q(I + D^Fv 
+ v'D- l {I - (I + D*)~ 1 F}v. 
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Since 

(/ - Q(I + D-^FQ')- 1 =1 + QFD*(I + £>„(/- F))~ 1 Q' , (A.8) 
we obtain 

{/ + QFD,(I + £>*(!- F))- 1 Q'}g(7 + D*)~ l F = QF{I + £>*(/- F))- 1 
and 

F(I + D^Q'il + QFD*(I + D*(I - F)y 1 Q'}Q{I + D^F 
= F 2 (I + Z>,) -1 (J + D*{I - F))- 1 . 
Hence we have 

G = {y - QF(I + £>,(/ - F^vYil + QFD*{I + D*(I - F))"^')" 1 

x {y - QF(I + D*(I - F))~ 1 v} 

- v'F 2 (I + D*)-\l + D*(I - F))~ 1 v + v'D-^I -{1 + D*y 1 F}v. 

Since the matrix for quadratic form of v in the "residual term" may be written 
as 

D- 1 {I-{I + D*)- 1 F}-F 2 {I + D*)- 1 {I + D*{I-F))- 1 
= (D. + /)(/ - F)D~ 1 (I + D„(I - F))-\ 
the lemma follows. □ 

A. 2. Proof of Theorem 3.2 

The Bayes predictive density p a {y\y) under the divergence D a for a = 1 is 
proportional to 



CXJP {JJJ l ° gP ^ 0, vMvl ' v)p(v* |m, rj)p(s\ri)ix(6, pi, ^dOd^dr] 
ocexpjy { tj — ®^ ^ Tr(e,fi,7]\v,v M ,,s)dedfxdTj\ (a.9) 



oc exp 



E{r]\v,v*,s) 



__ Q E[ V 6\v,v*,s] 



E[rj\v, v*,s] 



2^ 



Hence the Bayes solution with respect to the prior density 7r(#, /i, a 2 ) under D\ 
is the plug-in normal density 

Pa(y\y) = <l> m {y,Q0n,o-l) 

where ^ m (-, QO^, a 2 ) denotes the m-variate normal density with the mean vector 
Q6k and the covariance matrix o 2 I m and where 9^ and a 2 are given by 

- E[rj6\y] DV v m(v,v*,s) 

— v ■ 



E[r]\y} 2{d/ds}m(v,v*, S y 
1 m(w,w*,s) 



* E[ri\y] 2{d/ds}m(v,v*,s)' 
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and where m(v, u*, s) is the marginal density given by 

m(v,v*,s) = p(v\6,r))p(v*\iJ,,ri)p(s\r))Tr(6,iJ,,r))d6d(j,dr). (A.ll) 



Now we consider the marginal density of (i>,i>*,s) with respect to the prior 
tv (9, /z, rf), (3.1) with a = 1. Using essentially the same calculations as in Section 
3.1, we obtain the marginal density in the relatively simple form 



m(v,v„a) oc s-^-V^iv'C^D^v + |M|7 7 + s )-( fc / 2+a+1 ). 



(A.12) 



From the expression in (A. 10), a straightforward calculation gives the the esti- 
mators of 9 and a 2 in the simple form 



1 v,C 



v 



v+l + W 

V 



c- 1 ) V, 
s 



(A.13) 



v + 1 + W J n - k ' 

where W = {V'C~ l D~ 1 V '+\\V*\\ 2 /j} / S , respectively. This completes the proof. 



A. 3. Proof of Theorem 4.1 

Maruyama and Strawderman (2005) showed that, under the L\ loss, the risk 
function of a general shrinkage estimator 



with suitable <j) is given by 



E 



= E 



Li(^,0,<r a ) 

" V-9\\ 2 



= E 



+ E 



w 



^(V, V* , C, D, v) (n - k + 2)4>{W) 



1 



W(f>'(W) 



(i + <t>(w) 



4>(w) 

where ip(v,v*,C, D, v) is given by 
ij)(y,v*,C,D,v) 



v'C~ 2 v 



v'C- 1 D- 1 v + \\v.\\ a /'y' 



For 4> v (w) = vw/ (v + 1 + w), we have 

(n- k + 2)4>{w) +4(1- !^M(i + <j>(w) 

_ {(n - ft + 2)z/ + 4}w 2 + (1/ + l){v{n - ft - 2) + 4}i 
(l + v + w) 2 
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which is always positive when n — k — 2 > 0. Since ip is bounded from above by 
maxi<i</ di/ci, the risk function of 9 V satisfies 



E 



< MR e + E 



1 + lS + W 



^di di Un - k + 2)v + 4}W 2 

—2 > h max — 



di (v + l){v{n - k - 2) + A}W \ 



7 



where MRg = trD. 

Next we consider the risk function of ai = (1 — </>(W)/W)S/(n — k) where 
< <j)(w)/w < 1, which is given by 



E[L 2 (al,a 2 )]=E 



<f>(W)\ s 



s 



1 W ) (n - k)a 2 bg (n - fc)cr 2 



MR„2 + E 



<j){W) S 
Wa 2 n-k 



log 1 



w 



where MR„2 = log 7 — T'(-f) /r(7) and 7 = (n — k)/2. By the chi-square identity 
(See e.g. Efron and Morris (1976)), we have 



E 



cj){W)S 
Wa 2 

Also using the relation 



E 



(n-fc + 2)ffi-20'(WO 



3- 1 x 

log(l - x) = 2^ — < a; 



2 1-x' 



cj)(W) f 2 fW4f(W) 



for < x < 1, we have 

£[£ 2 (^,<r 2 )] 

< MFL.2 + E 

[ W {n-k\ 4>{W) 

For (j>(w) = vw/{v + 1 + w), we have 

E[L 2 {&l c ,o 2 )] 

< MR„2 + £" 



1 4>(w)/w 
1 + - max v 1 / \ / 

2 l-(j){w)/w 



W 



Hence 
l -E 



l + v + W \ n-kl + v + W 2 



£i(i, C ,0,CT 2 )] + ^S[L 2 (^ c ,a 2 )] < MR CT 2 - ^ 



V>(VT) 



2 L(l + ^ + W^) 3 
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where MRg CT 2 is the minimax risk given by (4.1) and 



= — 4 S > — - 2 max — + \ - v { 2 max - 

2 V c i c 4 n-fcj [ < 

+ (, + l) W ( 4 {^|-max|} + ^-,{( 

(1+Z/) 2 / v^d, \ „ 

+ L -^( 4 E---j^. 




} 



) 



Hence the theorem follows. 
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